Objectives:

  1. Examine residuals

  2. Learn about leverage and influential observations

  3. Straightening scatterplots

1 Examining Residuals

Read in the data set Penguins.csv using read.csv() and store the results in penguins. Use the clean_names() function from the janitor package on penguins to get consistent variable names. From this point forward, we will make a habit of using clean_names() for consistent variable names in our data frames.

library(janitor)  # consistent variable names
penguins <- read.csv("Penguins.csv") %>% 
  clean_names()
knitr::kable(head(penguins))
dive_heart_rate depth_m duration_min bird
88.8 5.0 1.050000 EP19
103.4 9.0 1.183333 EP19
97.4 22.0 1.916667 EP19
85.3 25.5 3.466667 EP19
60.6 30.5 7.083333 EP19
77.6 32.5 4.766667 EP19

Create a scatterplot of dive_heart_rate versus duration_min that resembles Figure 8.1 on page 240 of your text book.

ggplot(data = penguins, aes(x = duration_min, y = dive_heart_rate)) + 
  geom_point(color = "brown") + 
  theme_bw() + 
  labs(x = "Duration (minutes)", y = "Dive heart rate (beats per minute)")

Note the shape of the relationship in the scatterplot.

Find the least squares line for regressing dive_heart_rate onto duration_min and store the results in mod_pen.

mod_pen <- lm(dive_heart_rate ~ duration_min, data = penguins)
summary(mod_pen)

Call:
lm(formula = dive_heart_rate ~ duration_min, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.358  -8.356  -2.933  10.770  43.022 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    96.902      2.601   37.26   <2e-16 ***
duration_min   -5.468      0.311  -17.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.11 on 123 degrees of freedom
Multiple R-squared:  0.7153,    Adjusted R-squared:  0.713 
F-statistic:   309 on 1 and 123 DF,  p-value: < 2.2e-16

Plot the residuals for mod_pen versus the duration_min that resembles Figure 8.2 on page 240 of your text book. Start by creating a new data frame (NDF) based on mod_pen using the augment() function from the broom package.

library(broom)
NDF <- augment(mod_pen) %>% 
  clean_names()
knitr::kable(head(NDF))
dive_heart_rate duration_min fitted resid hat sigma cooksd std_resid
88.8 1.050000 91.16056 -2.360556 0.0270430 14.16888 0.0003996 -0.1695718
103.4 1.183333 90.43149 12.968513 0.0262406 14.12050 0.0116840 0.9312164
97.4 1.916667 86.42160 10.978397 0.0221361 14.13485 0.0070043 0.7866580
85.3 3.466667 77.94617 7.353832 0.0151798 14.15465 0.0021248 0.5250752
60.6 7.083333 58.17015 2.429848 0.0080252 14.16882 0.0001209 0.1728681
77.6 4.766667 70.83774 6.762262 0.0111452 14.15716 0.0013084 0.4818501
ggplot(data = NDF, aes(x = duration_min, y = resid)) +
  geom_point(color = "green4") + 
  theme_bw() + 
  geom_hline(yintercept = 0, lty = "dashed") + 
  labs(x = "Duration (min)", y = "Residuals")

Fix the model by transforming the response (take the log of dive_heart_rate) and create a scatterplot of the transformed response variable versus duration_min.

# Your Code Goes HERE

Find the least squares estimates using the transformed response and store the results in mod_pen2

penguins <- penguins %>% 
  mutate(log_dive_heart_rate = log(dive_heart_rate))
mod_pen2 <- lm(log_dive_heart_rate ~ duration_min, data = penguins)
summary(mod_pen2)

Call:
lm(formula = log_dive_heart_rate ~ duration_min, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47142 -0.13589 -0.01389  0.16653  0.44953 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.628122   0.040181  115.18   <2e-16 ***
duration_min -0.093604   0.004805  -19.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.218 on 123 degrees of freedom
Multiple R-squared:  0.7552,    Adjusted R-squared:  0.7532 
F-statistic: 379.5 on 1 and 123 DF,  p-value: < 2.2e-16

Plot the residuals of mod_pen2 versus duration_min.

# Your Code Goes HERE

Use mod_pen2 to predict the heart rate of a penguin from a dive of 10 minutes.

predict(mod_pen2, newdata = data.frame(duration_min = 10)) -> ans
ans
       1 
3.692084 
exp(ans)
       1 
40.12837 

Note: since mod_pen2 is the log (which by default in R is the natural log (ln)), to get the answer back in the original units use exp(ans).

1.1 Sifting Residuals for Groups

Read in the data set Cereals and regress calories onto sugars and store the model in an object named mod_cer.

# Your Code Goes HERE

Create a histogram of the residuals from mod_cer that resembles Figure 8.3 on page 241 of your text book.

# Your Code Goes HERE

Using mutate() to create groups similar to Figure 8.4 on page 241 of your text book.

NDF4 <- cereals %>%
  mutate(resid = residuals(mod_cer), fitted = fitted(mod_cer)) %>% 
  mutate(groups = ifelse(resid <= -25, "low", ifelse(resid <= 25, "med", "high")))
NDF4 %>% 
  filter(groups == "low") -> lowresid
NDF4 %>% 
  filter(groups == "high") -> highresid
knitr::kable(lowresid)
name mfr calories sugars carbo protein fat sodium fiber potass shelf middle shelf_1 shelf_2 shelf_3 resid fitted groups
100%_Bran N 70 6 5 4 1 130 10 280 3 No 0 0 1 -34.55946 104.55946 low
All-Bran K 70 5 7 4 1 260 9 320 3 No 0 0 1 -32.07444 102.07444 low
All-Bran_with_Extra_Fiber K 50 0 8 4 0 140 14 330 3 No 0 0 1 -39.64935 89.64935 low
Golden_Crisp P 100 15 11 2 0 45 0 40 1 No 1 0 0 -26.92463 126.92463 low
Puffed_Rice Q 50 0 13 1 0 0 0 15 3 No 0 0 1 -39.64935 89.64935 low
Puffed_Wheat Q 50 0 10 2 0 0 1 50 3 No 0 0 1 -39.64935 89.64935 low
knitr::kable(highresid)
name mfr calories sugars carbo protein fat sodium fiber potass shelf middle shelf_1 shelf_2 shelf_3 resid fitted groups
Just_Right_Fruit_&_Nut K 140 9 20 3 1 170 2 95 3 No 0 0 1 27.98548 112.0145 high
Muesli_Raisins,Dates,&_Almonds R 150 11 16 4 3 95 3 170 3 No 0 0 1 33.01544 116.9846 high
Muesli_Raisins,Peaches,&_Pecans R 150 11 16 4 3 150 3 170 3 No 0 0 1 33.01544 116.9846 high
Mueslix_Crispy_Blend K 160 13 17 3 2 150 3 160 3 No 0 0 1 38.04541 121.9546 high
Nutri-Grain_Almond-Raisin K 140 7 21 3 2 220 3 130 3 No 0 0 1 32.95552 107.0445 high
ggplot(data = NDF4, aes(x = fitted, y = resid, color = groups)) + 
  geom_point()+ 
  theme_bw()

Create a scatterplot of calories versus sugars color coded by groups. Add the least squares lines for the three different color coded groups to the scatterplot.

# Your Code Goes HERE
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw()

Explain the interaction and additive models below in class.

mod_gro <- lm(calories ~ sugars*groups, data = NDF4)
summary(mod_gro)

Call:
lm(formula = calories ~ sugars * groups, data = NDF4)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6518  -4.8504  -0.5769   5.1496  22.2832 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      112.6923    19.5002   5.779 1.85e-07 ***
sugars             3.4615     1.8747   1.846  0.06899 .  
groupslow        -62.1923    20.0090  -3.108  0.00271 ** 
groupsmed        -15.0405    19.6033  -0.767  0.44548    
sugars:groupslow  -0.1154     1.9840  -0.058  0.95379    
sugars:groupsmed  -2.0283     1.8909  -1.073  0.28704    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.55 on 71 degrees of freedom
Multiple R-squared:  0.8201,    Adjusted R-squared:  0.8074 
F-statistic: 64.74 on 5 and 71 DF,  p-value: < 2.2e-16
# Note groups high is the base
mod_gro$coefficients
     (Intercept)           sugars        groupslow        groupsmed 
     112.6923077        3.4615385      -62.1923077      -15.0404602 
sugars:groupslow sugars:groupsmed 
      -0.1153846       -2.0283261 
# LSE for high group
b0h <- mod_gro$coefficients[1]
b1h <- mod_gro$coefficients[2]
c(b0h, b1h)
(Intercept)      sugars 
 112.692308    3.461538 
# LSE for med group
b0m <- mod_gro$coefficients[1] + mod_gro$coefficients[4]
b1m <- mod_gro$coefficients[2] + mod_gro$coefficients[6]
c(b0m, b1m)
(Intercept)      sugars 
  97.651847    1.433212 
# LSE for low group
b0l <- mod_gro$coefficients[1] + mod_gro$coefficients[3]
b1l <- mod_gro$coefficients[2] + mod_gro$coefficients[5]
c(b0l, b1l)
(Intercept)      sugars 
  50.500000    3.346154 
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() +
  geom_abline(slope = b1h, intercept = b0h, size = 0.25) + 
  geom_abline(slope = b1m, intercept = b0m, size = 0.25) + 
  geom_abline(slope = b1l, intercept = b0l, size = 0.25)

1.2 Parallel slopes model

Load the moderndive package to use the geom_parallel_slopes() function to graph parallel lines. Create a scatterplot of calories versus sugars color coded by groups. Add parallel lines to the scatterplot.

library(moderndive)
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) + 
  geom_point() + 
  geom_parallel_slopes(se = FALSE) + 
  theme_bw()

Explain the code below and how to get the intercepts and slopes for an additive model.

# Parallel slope model
mod_ps <- lm(calories ~ sugars + groups, data = NDF4)
summary(mod_ps)

Call:
lm(formula = calories ~ sugars + groups, data = NDF4)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.0031  -6.2125  -0.8984   7.1906  20.5938 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  130.644      4.676  27.941  < 2e-16 ***
sugars         1.702      0.239   7.118 6.28e-10 ***
groupslow    -73.017      5.581 -13.083  < 2e-16 ***
groupsmed    -34.850      4.211  -8.275 4.28e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.921 on 73 degrees of freedom
Multiple R-squared:  0.7986,    Adjusted R-squared:  0.7904 
F-statistic: 96.51 on 3 and 73 DF,  p-value: < 2.2e-16
# Note groups high is the base
mod_ps$coefficients
(Intercept)      sugars   groupslow   groupsmed 
 130.643917    1.701577  -73.017416  -34.850289 
# LSE for high group
b0h <- mod_ps$coefficients[1]
b1h <- mod_ps$coefficients[2]
c(b0h, b1h)
(Intercept)      sugars 
 130.643917    1.701577 
# LSE for med group
b0m <- mod_ps$coefficients[1] + mod_ps$coefficients[4]
b1m <- mod_ps$coefficients[2]
c(b0m, b1m)
(Intercept)      sugars 
  95.793627    1.701577 
# LSE for low group
b0l <- mod_ps$coefficients[1] + mod_ps$coefficients[3]
b1l <- mod_ps$coefficients[2]
c(b0l, b1l)
(Intercept)      sugars 
  57.626501    1.701577 
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) + 
  geom_point() + 
  geom_parallel_slopes(se = FALSE) + 
  theme_bw() +
  geom_abline(slope = b1h, intercept = b0h, size = 0.25) + 
  geom_abline(slope = b1m, intercept = b0m, size = 0.25) + 
  geom_abline(slope = b1l, intercept = b0l, size = 0.25)

1.3 Book Figure 8.5

Recreate Figure 8.5 from page 242 of the text.

cereals <- cereals %>% 
  mutate(shelf = ifelse(shelf == 1, "bottom", ifelse(shelf == 2, "middle", "top")))
ggplot(data = cereals, aes(x = sugars, y = calories, color = shelf)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() + 
  labs(x = "Sugar (g)", y = "Calories")

1.4 Example 8.1 — Extrapolation: Reaching Beyond the Data

Read in the Marriage_age_2017.csv data set and reproduce the analysis in Example 8.1. Hint: use mutate(time = ifelse(year <= 1940, "FP", ifelse(year <= 1979, "MP", "TP"))) to create a new variable time with three time periods.

ma2017 <- read.csv("Marriage_age_2017.csv") %>% 
  clean_names()
ma2017 <- ma2017 %>% 
  mutate(time = ifelse(year <= 1940, "FP", ifelse(year <= 1979, "MP", "TP")))
# Your Code Goes HERE

1.5 Outliers, Leverage, and Influence

Read in the Election_2000.csv data and recreate Figure 8.9 from page 246 of your text.

election2000 <- read.csv("Election_2000.csv") %>% 
  clean_names()
p1 <- ggplot(data = election2000, aes(x = nader, y = buchanan, color = county)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "Nader (votes)", y = "Buchanan (votes)") + 
  guides(color=FALSE) # remove legend
p1

library(plotly) # create interactive graph
p2 <- ggplotly(p1)
p2
election2000_PB <- election2000 %>% 
  filter(county != "PALM BEACH")
dim(election2000_PB)
[1] 66  5
mod_no_pb <- lm(buchanan ~ nader, data = election2000_PB)
dim(election2000)
[1] 67  5
ggplot(data = election2000, aes(x = nader, y = buchanan)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "Nader (votes)", y = "Buchanan (votes)") + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(intercept = coefficients(mod_no_pb)[1], slope = coefficients(mod_no_pb)[2])

1.6 Lurking Variables and Causation

Read in the Doctors_and_life_expectancy.csv data set and recreate Figures 8.13 and 8.14 from page 249 of your text.

dale <- read.csv("Doctors_and_life_expectancy.csv") %>% 
  clean_names()
names(dale)
[1] "country"             "life_exp"            "sqrt_tv_person"     
[4] "sqrt_doctors_person"
ggplot(data = dale, aes( x = sqrt_doctors_person, y = life_exp)) + 
  geom_point(color = "brown") + 
  theme_bw() + 
  labs(x = expression(sqrt(Doctors/person)), y = "Life Expectancy (yr)", title = "Figure 8.13")

ggplot(data = dale, aes( x = sqrt_tv_person, y = life_exp)) + 
  geom_point(color = "blue") + 
  theme_bw() + 
  labs(x = expression(sqrt(TVs/person)), y = "Life Expectancy (yr)", title = "Figure 8.14")

Comment: What should we do to increase life expectancy?

1.7 Example 8.2

Read in the Dirt_bikes_2014.csv data set and store the result in db2014.

# Your Code Goes HERE

Create a scatterplot of msrp versus displacement.

# Your Code Goes HERE

Regress msrp onto displacement and store the result in mod_db.

mod_db <- lm(msrp ~ displacement, data = db2014)
summary(mod_db)

Call:
lm(formula = msrp ~ displacement, data = db2014)

Residuals:
    Min      1Q  Median      3Q     Max 
-2520.8 -1134.2   213.6  1026.1  2350.3 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2856.8362   250.2051   11.42   <2e-16 ***
displacement   15.2567     0.9348   16.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1335 on 112 degrees of freedom
Multiple R-squared:  0.704, Adjusted R-squared:  0.7013 
F-statistic: 266.4 on 1 and 112 DF,  p-value: < 2.2e-16

Create a scatterplot of the residuals for mod_db versus the fitted values using the values in NDF5.

NDF5 <- augment(mod_db) %>% 
  clean_names()
# Your Code Goes HERE

Add the variable displacement_third to the db2014 data frame by raising displacement to the 1/3 power.

db2014 <- db2014 %>% 
  mutate(displacement_third = displacement^(1/3))

Create a new data frame NDF6 using the filter() function that contains only Liquid or Air cooled engines.

NDF6 <- db2014 %>% 
  filter(air_cooled == "Liquid" | air_cooled == "Air")

Create a graph similar to the second graph of page 251 of your text using the data in NDF6.

# Your Code Goes HERE

Modify your previous code from the previous graph to create a scatterplot with parallel lines for the different engine types.

# Your Code Goes HERE

1.8 Re-Expression for Regression

Read in the Hand_dexterity.csv data set and create a scatterplot of time_sec versus age_yr that resembles Figure 8.20 on page 254 of your text book.

hd <- read.csv("Hand_dexterity.csv") %>% 
  clean_names() %>% 
  mutate(dominant = ifelse(dominant == 0, "Nondominant hand", "Dominant hand"))
# Your Code Goes HERE

Create a scatterplot of speed versus age_yr that resembles Figure 8.21 on page 254 of your text book.

# Your Code Goes HERE